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Abstract 

In the near future, all the human genes will be iden- 
tified. But understanding the functions coded in the 
genes is a much harder problem. For example, by 
using block entropy, one has that the DNA code is 
closer to a random code then written text, which in 
turn is less ordered then an ordinary computer code; 
see 0. 

Instead of saying that the DNA is badly written, 
using our programming standards, we might say that 
it is written in a different style — an evolutionary 
style. 

We will suggest a way to search for such a style 
in a quantified manner by using an artificial life pro- 
gram, and by giving a definition of general codes and 
a definition of style for such codes. 

1 Background 

Let us as a background cite three different sources. 
The first is J. Madox's comment on page 376 in 

"In genetics for example, the task of under- 
standing the functions of all the 100,000 hu- 
man genes will require a much greater ef- 
fort than that involved in their identifica- 
tion, and by a factor 10 or more." 

Chris Adami from Caltech made, in his survey talk 
at Renaissance Technologies in Stony Brook 10/27/98 
about artificial life, a brief remark about the quality 
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of the evolved program codes in his avida set up. He 
said something like this: 

"The codes that are evolved will eventually 
be almost totally unreadable. Things are 
never used only once, but two or more times. 
It is a kind of a 'madman's' code." 

In recent years, a lot of examples have been found 
where genes have been "reused" for different purposes 
during development. As an example, take the runt 
gene in Drosophila, which is used in sex determina- 
tion, segmentation and central nervous system cre- 
ation. In the authors write: 

"As mature organisms we are composed of 
an astonishing array of diverse cell types — 
all derived from a single-celled zygote. 
When faced with the task of generating such 
cellular diversity in a reproducible fashion, 
how has the embryo chosen to respond? Re- 
cent work in a number of developmental sys- 
tems has suggested that the embryo has em- 
ployed two approaches. First, given finite 
resources, the embryo has efficiently chosen 
to reutilize a limited set of proteins in differ- 
ent temporal and spatial contexts to create 
cellular diversity. Second, the embryo has 
also chosen to install molecular redundan- 
cies to ensure the reproducibility of these 
patterns from individual to individual." 

How can we capture these comments about the 
style or quality of the computer code and the DNA, 
in a quantified manner? Can we do that in such a 
general manner that we will be able to use analogous 
quality measure both for carbon- and silicon based 
genetic codes? 

1.1 Plan of the paper 

We give a straightforward but general definition of 
a "code", a general definition of the "style" of such 
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codes using a given set of measures. We will also 
suggest such measures of characteristic features, such 
as some type of "madness", the robustness, and the 
amount of reuse, of such codes. Finally, we will make 
some "baby-experiments" by running the avida pro- 
gram and analyze samples of the evolved code, gen- 
erate two simulated programs in the same function 
class, and eventually stylistically compare the real 
evolved code with these simulated codes, all of this 
will be done in a C-I--I- program. Some results will be 
graphically displayed at the end. 

1.2 Future goals 

One could then more systematically run the avida 
program (or something similar) and analyze the 
evolved codes stylistically with more realistic, and a 
higher number of comparasion codes. By changing 
parameters for the set up in avida, one might even- 
tually capture some common features. By comparing 
the carbon based programming style and the silicon 
based style, it might be possible to find some common 
parts that would describe the natural programming 
style for evolution. That would in turn help us read 
carbon based code. And it would also give us some 
hints how to create more robust computer programs, 
by looking at how nature has solved such problems. 

1.3 Questions 

Is there an existing way to study the style or quality 
of the DNA? And is there any existing theory in the 
information sciences that deals with this on the sili- 
con side? We are aiming at a complexity level higher 
than the usual information theory measures, such as 
the notion of entropy (see the comment in Section 
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2 A general code 

We will give a general definition of a code as a string 
of generalized "letters" from a given alphabet, that 
when interpreted, will define a function. This inter- 
pretation is not unique, i.e. many different codes will 
produce the same function. That fact will give us a 
way to study classes of codes. That is, two codes are 
in the same class if their interpretation gives equiva- 
lent functions. 

2.1 Codes 

We will give a definition of a code by using an under- 
lying alphabet, and a generalized interpretor. 

Definition 2.1 Let us define a code to be a finite 
string of "letters" taken from an "alphabet", A, 

Code = {ctij^^i, where ai e A, 

such that when the code is interpreted, the code 
will represent a well defined function (or a process), 
Codej — > fj , with a domain D f. such that for all in- 
puts, X € Df . , to the interpretation of the code, will 
give fj{x) as the outpu^. 

2.2 Classes of codes 

From the above definition of codes, we see that differ- 
ent codes can have the same function representation, 
see Example below. 

Let us therefore introduce the following classifica- 
tion. Let the code Code/ and the codes Code/, have 
the functions / and fi as their representations. 

Definition 2.2 Let the code class with respect to the 
function f be the following (infinite) set of codes. 

Cf = {Codsf. : fi{x) = f[x), for all x e Df}. 

Remark 2.3 If the interpretor is not able to produce 
a well defined function from the code Code, we say 
that Code is in the error class C^. That (huge) class 
can be viewed as the complement to all interpretable 
codes. 



^Note that we here consider a function in a general sense, 
i.e. not necessarily numerical. 
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Question 2.4 With the suggested norm of Cf below, 
is Cf compact? Do there exist extremal codes? 



of codes in C/. If a given measure has a range outside 
[0, 1], let us use the transformation 



Example 2.5 Let 



1 + x 



x-2' 



9{x) = 



x+2 



h{x) = 



ifx^-2 
ifx = -2 



if x = 
if x = 
if x = 
if x^ 
if x = 



-2 

1 
3 
4 



Note that Codeg and Codef are both inCh; andCf 

Cg C Ch- 



Remark 2.6 One common way to view functions is 
as black boxes. Here we are interested of the internal 
structure of such black boxes performing equivalent 
tasks. 

3 The style of a code 

We win now give a description of a method to charac- 
terize different coding styles. Since different codes do 
not differ in function in the same class, we say that 
they differ in style. How can we characterize such 
style? 

Given a set of measures on the codes, we propose 
a way to characterize style of a subset of codes in the 
code class as an extremal unit weight vector that will 
act as a stylistic "fingerprint". We will also give an 
algorithm for "stylistic translations", and an index 
which reveals how well a given subset represents a 
common style. 

The methods we use apply very elementary mathe- 
matics, and maybe even more basic statistical meth- 
ods. That will hopefully make it accessible to a wide 
scientific audience who are interested in "style" . 

3.1 A measure on Cf 

Let us now study measures on Cf. Let 

:C/^ [0,1]. 
Let us consider the following profile measure 



to make it fit into [0, 1]. 

We define the following scalar measure. 

^'w(Codeg) = w • /x(Codeg), 

where w is a normalized weight vector such that 
||w|| = 1, for some norm || • ||. For example, let 



where p> 1. As a default norm, let us use 1 1 • 1 1 = 1 1 ■ | 
3.2 Extremal weights 

Let us use the above measure to try to capture a 
characterization of "style" of codes. 

Suppose that we have two sample sets, A and B, 
of codes in Cf. We will try to find a style characteri- 
zation of the codes in A relative to B. 

We can think of A as the set of codes we are stylis- 
tically interested in and _B as a complementary envi- 
ronment. 

Let us define a vector u = (ui,U2, . ■ . ,Wn) in the 
following way. 



u = ^ ^ (/x(a,) - 



Let us now normalize u to a unit vector. 



u 



= w+(A) = — 



(1) 



(2) 



Let us now study the random variable 

X = iy^{ai) - iy^{bj), (3) 

where a.i is a randomly chosen code in A, with uni- 
form probability^, and bj is randomly chosen in B. 
Note that X is dependent on the chosen w. 

Proposition 3.1 Let X be the random variable de- 
fined in ^ and let w'^ be the unit vector from 
Then we have that picking w — w"*" will maximize 
the expected value of X , E{X). 



^The probability to choose is 1/#(A). 
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Proof: Let #(•) denote the cardinality of a set and 
let M — The expectation for a general 

weight vector w will be easily computed. 



M 

Since ||w|| = 1 we have that 



M ■ 



u 



\EiX)l < ^. 

Let us now look at the special case w = w+. We 
have then that 

^, , w+ • u 1 u 

E(X) = = — T— TT • U = 

^ ' M M u 



u 



u 



M||u|| M ■ 

Thus we see that taking w to be w+ will maximize 
E{X). □ 

3.3 A fingerprint on the hyper sphere 

We can now view the unit vector w+(j4) = w+ on 
the unit hypei^ sphere as a "fingerprint" of the style 
of codes in A relative to B with respect to the list of 
given measures in /x. 

Remark 3.2 // there is a superset of codes Bi D B 
which is enough separated, then the fingerprint w+ of 
A relative to Bi can be expected to be a more refined 
characterization of the "style" in A than the w"*" of 
A relative to B. 

3.4 A universal character 

One would hope that similar classes, Cf, would give 
similar fingerprints w+ for different function codes 
created by the same code writing agent. 

3.5 Stabihty with respect to the vec- 
tor /X 

There is a stability feature built into w+ in the sense 
that if you would like to find a stylistic quality in a 
group of codes, you try to find "relevant" measures 
in the vector fi. What happens if, in addition to 
your relevant measures, you also take a sequence of 

3if ra > 3 



irrelevant measures (where, for example, the codes 
look more or less randomly distributed, or even just 
the same)? If the environment, i.e. B, is rich enough, 
then your profile will just be about zero at the tail, 
where the non-relevant measures are. That means 
you don't have to be restrictive when you pick your 
measures — if some happen to be worthless, that will 
be taken care of by itself. 

3.6 Is there a common style in A? 

Given two subsets, A and B in Cf, of codes, we have 
now a method for finding a common style in A, in 
relation to B, as a unit vector w~^{A). (As a special 
case, we can let B be the complement of A in Cf. 
Then we can talk about the style of A.) 

This process can be executed even if A and B are 
just randomly chosen subsets in C/ where we can not 
expect to find any stylistic common features in A in 
comparison to all the codes in B. How can we find 
out if A really has a common style, in comparasion 
to B, that can be captured by the chosen measure 
profile fil 

Let us go back to Equation (|^) and the random 
variable X. Let us also define a similar s.v. Y such 
that 

Y = u^ici) - v^{cj), 

where Ci and Cj are randomly chosen, with equal 
probability, in A U B. Note that the underlying w 
is w+(A). From Proposition 3.1 above, the expected 



value of X will then be maximal. Let us denote that 
value by m, i.e. let E{X) = m. From the proof of 
Proposition 3.1 we see that 



u 



Hence, loosely speaking, m is large if A has a char- 
acteristic style that is captured by the measure profile 
11. How large can m get? Or in other words: how 
large can ||u|| get? Since all the measures ^k{-) are 
bounded above by 1 and below by we have that a 
component Ui of the u vector is also bounded. 



= E E (l^kiai) - fJ-kibj)) < 



ate A bjGB 



<E E(i-o) 



= M. 



If we are using the standard norm || • || — \ \ ■ ||2, we 
get that ||u|| < nM^ and hence ||u|| < ^/nM. 
Let us normalize m to get the following index 



TO 
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We see that 9 £ [0,1] and it will be closer to 1 when 
the fingerprint is good. 



On the other hand, due to the Remark 3.3 below, 
we might also need an invariant index. 

Let us use the variances of the random variables 
X, Y in the following way. Let 

a\ = E{{X ~ to)2), and let a^^ = E{Y^). 

Note that we immediately have that E{Y) = 0. Let 
us suggest an index r/ of how well the measure profile 
fji can capture a common style, if it exists at all, in A 
in comparison to the codes in B. Let 

-2 E{Y^) 



' AB 



E{X^) - TV?' 



(4) 



Remark 3.3 Note that if all the measures in fi are 
multiplied by a factor k < 1, i.e. fXi — > fc/ii, then 
m — > km, <t\ k^(^\, md <j\g — > k^a\g. Hence rj 
would not change, hut 9 k9. 

3.6.1 Principal component analysis 

A very illustrative, and very popular, technique when 
looking for connections in a multidimensional en- 
vironment is the principal component analysis. It 
is a two dimensional diagram with coordinate axis 
the two eigenvectors with the largest eigenvalues of 
the covariance matrix M. Let C = A U B and let 
TV = #C. We denote the codes in C by Cj , then 



Mi^j — Gov 



{H,{Cl),^l^{c2), . . .,Hi{cn)), 



i^I■j{cl),fl■i{c2),...,^ijicN)) 



For more details, see any book in multivariate anal- 
ysis, for example JTsf . 

If in such a diagram, the codes A we are interested 
in are clearly separated from the B codes, then we 
could say that there is a common style in A, and if 
that is the case and if the largest eigenvalue is con- 
siderably larger then the second one, then the finger- 
print w and the first eigenvector should be close to 
each other. That is indeed the case in Figure |^. 

3.6.2 Cluster analysis 

Another tool from the multivariate toolbox could be 
used to study the question about a common style in 
A. This method is based on an iteration fusion of 
close points until the desired number of subsets are 
obtained. To measure the closeness, we might pick 
our scalar measure . 

In order to check if there is a common style in A, 
we can ask how well the points in A are clustered. 



3.7 Style translations by iterations 

Using the above construction, let us here indicate an 
algorithm how to "stylistically translate" a code a in 
A C Cf, into the style in S C C/. 



1. Compute 



bieB 

2. Find the dominating component of the vector v, 
i.e. let 

Vm = max \vi\. 

l<i<n 

3. Study the measure fim and stylistically rewrite 
a such that fim (a) would increase approximately 
Vm units (decrease if < 0). 

4. Iterate the process until sufficient accuracy is at- 
tained. The accuracy is measured by ||v||. 

Suppose the final accuracy is 6 and denote the 
rewritten a by a'. Let w = w+(_B) and let Z be 
the random variable 

Z = iy^{bi) - v^ia'), 

for a randomly chosen hi in B. Now the expected 
value of Z will be 



E{Z) = 



where #(•) stands for the number of elements in the 
set. That is 

E{Z) 



*{B) 

Now if we have the norm 1 1 • 1 1 as the default 1 1 • 1 12 we 
can use Cauchy-Schwarz' inequality to get that 

\E{Z)\ < 



In other words, the fingerprint w"*" of B would hardly 
"feel" the difference between a' and the codes in B if 
5 is small. 

4 Different levels of Code 

So far we have just been studying a code on a singular 
level. In this section we will describe a way to sep- 
arate the codes into different levels. One could then 
ask questions about the styles on different levels. Are 
the styles similar even on different levels, etc? 
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We will view a code as a composition of lower level 
codes. Let us use the definition of emergence given 
in § on p. 518. 

P is an emergent property of S'^ 



P € Obs^(S'2), but P ^ Obs^iSl) Wii. 

Let us explain this more in detail, and similarly dis- 
play a concrete example of a computer code, where 
the inputs are integers. 

In this case, let 5° be input values, e.g. Sf G N, 
and let {lnt'}"^o be a given sequence of sets of in- 
teractions, e.g. Int° = {+,-}, Int^ = {*,^}U lnt°, 
Int^ = {=,<,>}u lnt\ etc. Let {Obs'jf^o be the 
related observational function, e.g. Obs''(a:) = value 
of X as an integer, Obs^(x > y) = true or false. 

From the given sequences {lnt*}"^g and {Obs'}"^Q 
we get the higher order structures as a "reaction" , R, 
in the foUwing manner. 

= i?(5°,0bs",lnt"), 

= i?(S'\ Obs\lnt^), 

see [Q for details. 

We have then that in our example, —1 is an emer- 
gent property of = Z, 1/2 is an emergent property 
of 5^ = Q, and "true" is an emergent property of S^, 
etc. 

Given such sequence {lnt*}"^o we can define a code 
of degree fc as a consecutive string of the total code 
which has the property of Obs'", i.e. if the code can 
produce an output that is an emergent property of 

We can then view the final code of degree n, 
Code„, as a composition of sub-codes of degree n—1, 
Code^_i, etc. 

Note that Baas indicates this application in men- 
tioning the word hyper algorithms on p. 526 in Q. 

How can one think of a good implementation of 
these functions? The choice of the interactions Int 
will give us a chance to find fine structures. What 
happens to our example if we start by lnt° — {+} 
and then Int^ = {— } U Int" etc? Is there a "natural" 
choice of that interaction sequence for a given case? 

5 Some measures 

Let us list a couple of measures that would be useful 
to capture some of the features mentioned in the Sec- 
tion |l| above, and which also utilizes different levels 
of codes discussed above. 



5.1 Spaghetti 

Let us propose a kind of "code-madness-measure" us- 
ing the above hierarchies of codes. 

Let mfc be the maximum numbers of Codefe_i codes 
in a Codefc. More precisely. Let 

rff, = max{j : Code^^i C Codefe}, 

and let 

rrik = max 77^ and Sk = ^ 77^ . 

i 

Now, we define the "spaghetti length" at level k to 
be 

5fc(Code„) = and 

Sk 

^(Coden) — maxS'fc(Code„). 

k 

5.2 Reuse 

Let < A: < n 

i?UCode„) = 
max ^Codefe_i C Code^ used i times or more 

Sk 

Let us use the conventions i?^(Code) — i?fe(Code) 
and i?i(Code) = i?^(Code). 

5.3 Redundancy 

A very important feature in evolutionary driven codes 
are their robustness. There are good measures of ro- 
bustness given in the literature. One way to measure 
it is to to check the probability that the code "sur- 
vives" a one point mutation. 

The redundancy in a code of degree fc, could be 
measured in the following way. 

Let m be the maximal number of subunit codes 
of type Codefc-i in the code of type Codefc, that can 
be taken away without affecting the output of the 
Code^ code, and let n be the total number of subunits. 
Let us then define the redundancy in code C of type 
Codefe by 

777 

Red{C) = -. 

n 

5.4 Brittleness 

Another function that might be better to use in some 
ways would be a "brittleness" function that we define 
in the following way. 

Let TO and n be as above and let d{k) be the number 
of Codefc_i codes in a Code^; code, that when removed 
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totally, or partially, destroys the code Code^:. The we 
define the brittleness of Code^: as. 



Britt(Codefc) 



d{k) 



That can be viewed in the following way. Let us think 
of Code/c as a chain and the subunits as links in that 
chain. Then the length of the chain will be n — m and 
d{k) will be the number of links that are can not be 
removed without breaking the chain. 

Note also that in the special case when we either 
have one or two links for each step in the chain, that 
d{k) — n ~ 2m. 

Note that the probability to survive a deletion of 
a randomly chosen subcode is 1 — d{k)/n. That 
is, if there would just be one sublevel to the code, 
1 — d{k)/n would be the usual robustness measure 
mentioned above. 

6 Five applications 

Since we are mainly interested in searching for a evo- 
lutionary universal coding style, we are of course in- 
terested in the two special cases when the code is 
a computer code, and when it is a sequence of the 
DNA. We have tried to make the above definitions 
of codes and style general enough to be able to deal 
with those cases. 

As a by-product, we noted that this theoretical 
framework could also be applied to other areas where 
'style' is essential. The applications that comes closes 
to our mind was art, music and literature. As a third 
application we will comment how the above stylistic 
fingerprint could be used for a common framework in 
"stylometry" in the study of authorship attribution. 
Our hope is that if a theory is not only applicable to 
the special cases it was aimed for under its construc- 
tion, but also to a different case, it might be a sound 
approach in that theory. 

A fourth possible application would be a stylis- 
tic investigation of the internal architecture of black 
boxes in the theory of neural nets, and its carbon 
based version — the brain. 

The fifth application is about artificial life. We will 
make a small experiment in such an environment. 

6.1 Computer code 

By looking at the indicated toy-example above where 
we had natural numbers as inputs and — as the 
first interactions, it is not to hard to imagine that 
you would get the intuitive "usual meaning" increase 
in complexity in substructures in the code from lower 



order arithmetic operations, more complex functions, 
subroutines, program parts, the complete program. 

Note that what usually is seen as a good program- 
ming style, that is separate codes into small, more 
or less, independent units, and not too many jumps 
back and forth, will give you a low 5(Code); see Sec- 
tion 



5.1 



Let us now mention three existing families of com- 
puter code measures. See also [10 and [|l2| for a de- 
tailed description of the two first examples and many 
others. 

6.1.1 Halstead's Complexity Measures 

In 1977 M. Halstead introduced a tool to measure the 
complexity of a computer program. It is perhaps the 
most well known measure of that kind. 

The measure, or more precisely the family of five 
measures, is based directly on the code in the fol- 
lowing way. Let ni be the number of distinct op- 
erators, n2 the number of distinct operands, A'^i the 
total number of operators, and the total number 
of operands. From these numbers, the following mea- 
sures are constructed. 



Program vocabulary 

Program length 

Difficulty 

Volume 

Effort 



n = ni+ n2- 
N ^Ni +N2. 
D = 

2n2 

V = N\og2{n). 
E = DV. 



The Halstead's measure seems to be a good can- 
didate for measures in the stylistic search since it is 
only based on textual information and due to the fact 
that it has been used in many contexts and over such 
a long time and hence its properties are quite well 
known. 

6.1.2 McCabe's Cyclomatic Complexity 

In 1976, T. McCabe introduced a measure of the 
number of linearly independent paths through a com- 
puter program as a measure of the complexity of the 
code. 

The cyclomatic complexity, CC, is computed in 
the following way. Let us study a schematic graph of 
the program and count the number of edges the 
number of nodes iV, and the number of connected 
components c. Then 

CC = E - N + c. 

If a program has a CC higher than 50 it is said to be 
unstable, since it is then "very likely" break down if 
it is altered. 
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The Cyclomatic Complexity is more of a measure 
of the inner logical complexity then the textual Hal- 
stead's measure. 

6.1.3 GRASP 

Let us describe an ongoing project at Auburn, Al- 
abama, which address a numerical local measure of 
computer codes, and its display. See |^ for further 
information and down-loading of the program. 

"The overall goal of the GRASP project 
is to improve the comprehensibility of soft- 
ware. Thus, it is important to be able to 
identify complex areas of source code. The 
Complexity Profile Graph (CPG), a new 
graphical representation based on a com- 
posite of statement level complexity met- 
rics provides the user with the capability to 
quickly recognize complex areas of source 
code. The CPG is significant in that it 
shows the complexity of a program unit as a 
profile of statement level complexity metrics 
rather than as a single, global metric." 

First the program code is parsed into non over- 
lapping segments; then a series of measures, briefly 
described below, is applied to each segment, giving a 
local complexity measure of the program code. 

The content complexity of a segment S in the code 
is defined as 

ry(5) = Zog(^Weight(T)), 

Tes 

where T are tokens in the segment S. For example, 
in 1^, the weights for Ada 95 are given in the table 
below. 



Token Description 


Symbol 


Weight 


Logical operators 


and, or, not, ... 


1.5 


Comparison op. 


<,>,=,<=, ... 


1.5 


Left parenthesis 


( 


1.3 


Identifiers 


varl, procl, ... 


1.0 


Others 


+ ,-,-,/,),••• 


1.0 



"The context complexity provides a baseline 
level of complexity for segments of simple 
statements nested within a compound state- 
ment, which itself may be nested several lev- 
els deep. The complexity of a compound 
statement is based on three aspects: inher- 
ent complexity, reachability, and breadth." 

These three complexities are added, with weights, 
to obtain the context complexity. 



Combining the content complexity and the context 
complexity, by a weighted sum, gives the profile met- 
ric /.t(5) for a segment S. 

The local metric fj, can then be graphically pre- 
sented as a histogram to give indications where the 
code is more complex. Thus that would help a pro- 
grammer to point out code segments that would need 
some extra thought. 

6.1.4 Finding a coding style 

Suppose that we would like to find a way to assign 
a specific style to a programmer. How could we do 
that? 

Suppose he writes in CH — h. Then one might take 
a sample of his codes, look at his functions and sub- 
routines. Go to other programmers and find as many 
functions as possible from that environment that one 
would like to be able to identify "our" programmer 
in the future. 

Sort the codes into function classes Cf where our 
programmer has written aj ^ Cj and other program- 
mers 5/j & B C Cf. 

Let us now take a wide variety of measures of com- 
puter codes, such as those mentioned above, e.g. Hal- 
stead's measures etc. Let us call the vector of those 
measures fi, where each /i^ is a function of a CH — h 
code to the unit interval [0, 1]. 

Let use equation (|]). 



Now, 



u 



will be the stylistic fingerprint of this programmer 
with respect to the vector of measures fi and in the 
chosen environment. Note that w+ will be heavily 
dependent on the chosen reference codes, so it is im- 
portant that that environment is chosen carefully. 

6.2 DNA 

It is hard to get a handle on the DNA code with 
usual computer code tools. As an example in the 
authors study block entropies for DNA to find some 
indication of structure. 

Definition 6.1 Let the length of the alphabet be A. 
Then the (normalized) n~block entropy of a sequence 
is defined as 



A" 



(n) 
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where p^^^ is the probability of the ith combination of 
n "letters". 

In the summary of [l^ one can read the following. 

"Surprisingly, DNA sequences behave closer 
to completely random sequences than to 
written text. The very strict syntax of com- 
puter languages on the other hand is re- 
flected by a very low average information 
content of its sub-strings." 

To us that means that the style of DNA sequences 
is not just close to chaos, but rather written in such 
a different style compared to written text, and even 
more different from computer codes. 

The inputs, S'^ here are the amino acids, and one 
could take perhaps lnt° to be "putting next to", or 
sequencing. 

The letters on the DNA level should be the four 
bases, the words are then three letters word coding 
for a amino acid. Add to the dictionary special start 
and stop words for the genes. 

One could also study the situation on the amino 
acid level, i.e. one level up from the bare DNA code. 
That would give us 20 letters in an alphabet The 
words would here be the genes. 

6.2.1 Hopes 

One would hope that some hypothetical insight of a 
universal evolutionary driven style would give some 
hints how to better "read" the DNA code. 

We would therefore try to find measures that are 
general enough in the below described computer ex- 
periment, so that some hypothesis about the DNA 
code could eventually be made. 

There are indications that evolutionary driven 
code, may not be "optimal" in a basic sense, but 
might include peculiar turns and twists. See for ex- 
ample p. 180 in 1 11 where J. Madox discuss RNA 
Editing 

"The puzzle is to know why these changes, 
which are presumably advantageous to the 
organism, have not been incorporated in 
the gene themselves, thus avoiding the need 
for editing by way of afterthought — not to 
mention the need for a separate biochemical 
mechanism for carrying it out." 

He also addresses, on p. 203, the "junk code" in 
eukaryotic cells. 



"At the very least, this complication is an 
extra metabolic cost for eukaryotic cells. It 
is also potentially a source of error. What 
countervailing selective advantage can there 
possibly be in this arrangement?" 

6.3 Literature 

Let us now turn our attention to something differ- 
ent. Let us look at some examples in literature where 
"style" has been in the focus. 

On p. 74 in jl^ the author discuss computational 
differences between grammatical errors and stylistic 
weaknesses. 

In p^ , seven important problems with the exist- 
ing authorship attribution studies are listed and dis- 
cussed and some solutions proposed. The proposed 
solution to problem number three is to 

"study style in its totality. Approximately 
1,000 style markers have already been iso- 
lated. We must strive to identify all of the 
markers that make up "style" — to map 
style the way biologists are mapping the 
genes." 

Furthermore, the suggested solution to problem five 
is to 

"Develop a complete and necessar- 
ily multi-faceted theoretical framework on 
which to hang all non-traditional authorship 
attribution studies. 

Publish the theories, discuss the theo- 
ries, and put the theories to experimental 
tests." 

As an example of a suggested metric from the lit- 
erature studies, one can take the Yule's coefficient 
advocated among many others in Let {fij} rep- 
resent the observed frequencies in a "two way contin- 
gency table" and let the Yule's coefficient (see pO| ) 
be defined as 



Y = 



where c — 



/11/2: 



22 



/12/2 



*The alphabet is not unique, e.g. different spelling in ex- 
ample some bacterias and humans for some amino acids. 



6.3.1 Stylometry: Finding the author 

Let us now treat a hypothetical case, using the meth- 
ods from Section ^, on the authorship attribution 
problem. 

What do we mean by a code in this case? Let us 
look at Definition 2.1. We will interpret the "let- 
ters" as word taken from the "alphabet" which will 
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in this case be a complete dictionary of the language 
in question. What about the functions? 

Scarry gives in ||l6|] a description how we can think 
about the reading process; see the following quote 
from the first chapter: 

"When we say 'Emily Bronte describes 
Catherine's face,' we might also say 'Bronte 
gives us a set of instructions for how to 
imagine or construct Catherine's face.' This 
reformulation is accurate if cumbersome, in 
that it shifts the site of mimesis from the 
object to the mental act." 

So, in this case we might think about the func- 
tions as descriptions, or simply constant functions, 
where the interpreter is the reader]^ Examples of in- 
terpreted functions would be text where "boy meets 
girl" , or "prince meets ghost" . An even more refined 
version would be "boy meets girl described in an En- 
glish sonnet" . 

Suppose now that we want to test the hypothesis 
that author X has written a given sonnet s. Then we 
might gather all know sonnets written by X into the 
set A as a subset of reference sonnets from that time 
period in S. Now ai € A means sonnet number i in 
A. 

One wants to have as large environment S as pos- 
sible, but at the same time also as narrow as possible, 
i.e. from the same time period, etc. This has then to 
be chosen carefully and with great knowledge about 
the literature period and its authors. Let now B be 
the set of comparison sonnets. 

On the other hand when it comes to choose met- 
rics, then we pick as many as it is numerically com- 
putable in reasonable time, which depends both on 
our time and our computer. For example, there has 
been some new interesting development using neural 
networks in authorship attributions, see for example 
pof . An even more exciting method was used in Q 
where they used genetic algorithms in order to find 
the best features to measure. We can use such nets 
and genetically derived measures in our set of mea- 
sures too! 

Now, apply equation (|l|) to get the vector u which 
then is normalized to w+. That will be the stylistic 
finger print of author X, given the above constructed 
framework. 

How good and reliable is this fingerprint? Suppose 
that A is large and S is rich, not only large but more 

^Here it is easy to see the crucial importance of the inter- 
preter. I would for example be an extremely poor interpreter 
of a French text, even if I would get some picture of a story at 
the end. 



or less complete with respect to the author represen- 
tation. Then 77 from equation would be a good 
measure how reliable w"*" is. We want 77 to be large 
of course, but what is large enough? 

To answer that we need to study the actual dis- 
tributions of the sonnets evaluated by v^, make ap- 
proximations and perform hypothesis tests. 

The nice thing about this method is that one does 
not have to argue which measure should be applied. 
Just apply them all! 

Talking about measures, there has been a very ani- 
mated debate about the values of using computers in 
literary studies. R. Quiones said in Q "Why don't 
they simply read the plays?" . A literature expert 
reading a text is undoubtly very hard to beat when 
it comes to authorship attributions. But let us look 
at this situation having our suggested definition of 
measures of codes in mind. Isn't the expert using a 
long array of measures, weighted together in an in- 
tricate and sometimes subconscious way? One mea- 
sure could be: "X would never use that word in two 
consecutive sentences". The measure would be the 
characteristic function of that event in the text, and 
the weight would be heavily negative. The more skill 
and familiarity the expert has about such tasks, the 
lager the array fi, and the more subtle the weighing 
process would be. It may seem like a trivialization to 
think in those terms, but since the task is very hard 
and complex, one would not expect that a computer 
would be programmed in the near future that would 
be generally better than a literature expert. Com- 
pare with the relatively straightforward problem to 
play chess. 

6.4 Neural nets 

P. Adams in |^ pointed out the possibility of strong 
evolutionary forces acting inside the brain in the pro- 
cess of learning. "Good" synapses will be rewarded 
by being strengthened, and "bad" will be punished 
by being weakened. What is good and bad are much 
more intricate and implicit qualities compared to the 
genetic evolution, but maybe evolutionary forces are 
in command in the learning process on a time scale 
of seconds instead of millenias. And if there is a uni- 
versal coding style — that should then be found in the 
style of the programming of the neural nets too. 

6.5 Artificial life 

Avida, see [0 and ||l|, is a program in the Tierra 
class, see | ]l8t . Unlike Tierra it gives a natural 2 di- 
mensional picture of an evolutionary process. Like 
Tierra, Avida is not a simulation of real carbon based 
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DNA, but a real evolution in a silicon based (simple) This approach will give us a list of operations, and 
world. We will now do an experiment in the Avida operands needed to fulfill /. We can now try to look 
world. for style. 



7 A small experiment 

Let us try some of the above concepts in a Avida 
experiment. We will deliberately try to make it as 
simple as possible in order to get some output in a 
straightforward way. But we will keep all the doors 
open to make variants and generalizations in the fu- 
ture. 

We use Avida version 1.0.1 that is available on the 
web; see 

When running the program, you have the possibil- 
ity to extract individuals and saving their data and 
their genetic code. To read more about this, see 
the documentation on or even better — the book 
101 . The code and the data is saved to files such as 
153-aagxs. 

The data in 153-aagxs tells you, for example, that 
in addition to being able to replicate itself, the code 
also performs some other tasks. In this specific case, 
it takes input from a stack, performs a logical XOR 
twice, performs three NOTs, etc. For this, it is re- 
warded with a bigger time slice, and hence will repro- 
duce (and survive) better. 

7.1 The function class 

Let us as the function / defining the code universe 
Cf, take a function that exactly performs the above 
described set of tasks, e.g. three NOTS etc. 

As the subspace A we simply take the generated 
avida code in 153-aagxs. 

7.2 Comparison codes 

To get a comparison environment, let us construct 
(or simulate) two, man-made codes that perform the 
operations in /. 

We use a very simple approach and simulate two 
different codes in B. One with no loop, except for 
the self-copying loop, and one with as many loops as 
there are logical tasks to be done. Furthermore, we 
don't actually write the codes but simulate the writ- 
ing using a very rigid approach of making all the log- 
ical statements from combinations of NAND^, and 
then in detail study the needed operations just in one 
NAND. 



7.3 The measure 

To start with something, we used the Halstead's mea- 
sures as our fi. That is, let 

fi = (vocabulary, length, difficulty, volume, effort). 



7.4 Simulations 

The actual simulations is done by extracting crea- 
tures, i.e. individual programs, from a run of the 
Avida program, as described above. The extraction 
consists of some data, the table of performance, but 
also the program itself. We then use a, C++ program 
to automatically analyze the extraction by first read- 
ing of the data and the performance table, and then 
find the Halstead's measure vector fi for the actual 
code. The C + + program then also uses the perfor- 
mance table to simulate the two different comparison 
codes described in Section 7.2 above and to calculate 



their Halstead's measures respectively. The program 
then also computes w when A has only the code of 
extracted individual, and B consist of the two com- 
parison codes. ( w for other combinations of A and 
B are also considered.) The results are then exported 
to a Maple file where one can more easily work with 
the output. 

7.5 Preliminary results 

Since #A is one, and #_B is two, we can not talk 
about results in any statistical sense of course. Nev- 
ertheless, we can present some outcome that should 
be seen as a indicator of what eventually can be done. 
Even if we have a very small C / , since essentially every 
creature represents a unique list of performed tasks, 
which leads to essentially a unique function /; we 
have many classes Cf, and we are free to change pa- 
rameters, such as the initial random seed, the reward 
table for the tasks, etc. We can therefore compare 
profiles; see for example Figure ^. 

Let us more in detail see what kind of output you 
can get by going back to our old friend, born after 
19441 generations, 153-aagxs. 

After feeding the file 153-aagxs into the C++ pro- 
gram, which disects the code, and simulates the two 
comparison codes, we get as outputs things like: The 
Halstead's measures: 



°This is the base for the default reward list for completed 
tasks in time slicing; see the file task. set in the Avida package. 



ni 



19,^2 = 3,7Vi = 153, iVa = 31 
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difficulty = 98.1667 
volume = 820.535 
effort = 80549.2. 

And we also get 

e = 0.0085549 and 77 = 391865, 

which does not really tell you much in this meager 
situation (ry is huge since the variance of X is so ex- 
tremely tiny for this special case). We can view w+ 
in a diagram; see Figure ^. (Remember that we nor- 
malized the measures into [0, 1] by the transforma- 
tion in order to make the weights in w play in 
the same division.) 




Figure 1: Here is an example of a fingerprint after 
about 1000 generations using Halstead's complexity 
measures as {jl. 




Figure 2: Here are seven fingerprints afier about 
8500. Can we hope for some convergence? And if 
so, what will that tell us? 




Figure 3: Here is w+ when A is the single Avida 
generated code 153-aagxs and B consists of the no- 
loop-code and the all-loop-code, both based entirely on 
NAND combinations. 
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N 



A 



Figure 5: Here is A 153-aagxs and B is the all-loop- 
code. 



Figure 7: Here is a picture of a principal component 
analysis, see Section 3.t oj the fi vector for the three 
different codes connected to 153-aagxs. The letter 
A stands for the Avida code and the letters N and L 
for the no-loop code and the loop code. We see that 
the two simulated codes, N and L, are more together 
indicating that they are more similar to each other 
than the evolutionary generated code from Avida. 
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